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Abstract 


Background: Amazona vittata is a critically endangered Puerto Rican endemic bird, the only surviving native parrot 
species in the United States territory, and the first parrot in the large Neotropical genus Amazona, to be studied 
on a genomic scale. 


Findings: In a unique community-based funded project, DNA from an A. vittata female was sequenced using a 
HiSeq Illumina platform, resulting in a total of ~42.5 billion nucleotide bases. This provided approximately 26.89x 
average coverage depth at the completion of this funding phase. Filtering followed by assembly resulted in 
259,423 contigs (N50 = 6,983 bp, longest = 75,003 bp), which was further scaffolded into 148,255 fragments 

(N50 = 19,470, longest = 206,462 bp). This provided ~76% coverage of the genome based on an estimated size of 
1.58 Gb. The assembled scaffolds allowed basic genomic annotation and comparative analyses with other available 
avian whole-genome sequences. 








Conclusions: The current data represents the first genomic information from and work carried out with a unique 
source of funding. This analysis further provides a means for directed training of young researchers in genetic and 
bioinformatics analyses and will facilitate progress towards a full assembly and annotation of the Puerto Rican 
parrot genome. It also adds extensive genomic data to a new branch of the avian tree, making it useful for 
comparative analyses with other avian species. Ultimately, the knowledge acquired from these data will contribute 
to an improved understanding of the overall population health of this species and aid in ongoing and future 
conservation efforts. 
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Table 1 Average coverage of the Puerto Rican parrot genome in the current study based on the predicted genome size 


of 1.58Gb [1] 








Sample Sequence information Total bases Read count Coverage Total 
Pada Pa9a_1 13,496,744,938 133,631,138 

(~300 bp inserts) Pa9a_2 13,496,744,938 133,631,138 17.08X 

Pada Pa9a-MP_1 7,743,004,915 76,063,415 

(~2.5 kbp inserts) Pa9a-MP_2 7,743,004,915 76,663,415 9.90X 26.89X 





Data description 

A locally funded genomic sequencing project provided 
the first phase of genome sequencing of the Puerto 
Rican Parrot (Amazona vittata) (see Developing of the 
Local Community Involvement in Additional file 1). 
DNA was purified from a female A. vittata blood sample 
(see Additional file 2: Table S1), and sequencing was 
initiated with the construction of two genome libraries: 
the majority of sequencing used a short fragment library 
(~300 bp inserts), and scaffolds were generated using 
a long fragment library (~2.5 kb inserts). Raw Illumina 
HiSeq reads were processed and filtered using the Geno- 
me Analyzer Pipeline software (as per the manufacturer’s 


Table 2 Results of the genome assembly by Ray [2] 














instructions at default parameters). Of the 309,060,168 
paired-end reads and the 180,079,956 mate-pair reads, 
respectively, 86.48% and 85.14% passed QC, using the 
condition that if one read from a pair failed the QC, the 
entire pair was filtered out. Based on the total number of 
base pairs generated (see Additional file 3: Table S2), and 
the predicted genome size of 1.58 Gb [1], we calculated 
a total genome coverage of 26.89x depth: with 17.08x 
coverage for short fragment reads, and 9.8x for mate 
pairs (Table 1 and Additional file 3: Table $2) (see Sample 
Collection and Genome Sequencing sin Additional file 1). 

We carried out two separate de novo assemblies, 
using Ray [2] software (Table 2) and SOAPdenovo [3] 
(Additional file 4: Table $3), and selected the Ray assem- 
bly for use in all further analyses. Our genome coverage 
was approximately 76%, which, given some of the scaf- 

















Caegery = Wont = B0C'M folds may be overlapping and could not be properly 
Contigs Number 358,398 259423 assembled, might be slightly overestimated. (see Assem- 
Total length 1,137,438,369 1,116807,713 bly in Additional file 1). We evaluated assembly by 
Average 3,173 4304 comparing the entire collection of transcripts listed for 
Largest 75,003 75003  @ gallus in the NCBI Entrez Gene database using 
vo 1637 2774 local BLAST [4] and found that >70% of the chicken 
transcripts were present, and as much as 11% of scaf- 
= saat 6983 folds shared similarity with at least one G. gallus se- 
peaffalds Number es 148,255 quence at average density of 1.39 genes/kbp (Table 3; 
Total length 1,184,594,388 1,164,566,833 Additional file 5: Figure $1). 
Average 4816 7,855 RepeatMasker software (http://www.repeatmasker.org) 
Largest 206,462 206,462 was used to search scaffolds for the presence of the 
Whaalfar 1048 2913 known repeat classes with known repeats found on 59% 
of the scaffolds (see Annotation in Additional file 1). 
N50 19,033 19,470 as : ‘ 
In addition, we used manual annotation, both by annotation 
Table 3 Annotation summary 
Scaffolds mapped to: Scaffolds mRNAs* Repeats 
N (%)* N (%)* % of the scaffold N (%)* % of the scaffold 
G. gallus genome only 53,345 22% 1,256 5% 8% 88,157 76% 77% 
Unmapped 105,030 43% 1,429 2% 22% 125,470 48% 19.4% 
T. guttata genome only 26,078 11% 4,206 21% 7% 87,592 93% 2.1% 
Mismatched 54,621 22% 12,030 27% 2% 266,478 98% 1.0% 
G. gallus and T. guttata 6,873 3% 1,426 26% 3% 32,994 98% 1.2% 
Total 245,947 100% 20,347 11% 4% 600,691 59% 4.3% 





* mRNAs are from G. gallus. 
# Percentage values are from total number of scaffolds. 
* Percentage values are from the number of scaffolds in that category. 
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Figure 1 Density of the A. vittata scaffolds that shared similarity with fragments of chicken and zebra finch genomes (Top) Chicken 
(G. gallus genome (per Mbp) and (Bottom) zebra finch (T. guttata) genome (per Mbp). Different chromosomes are represented by different 
colors as shown in the legend on the right. Chromosomal locations, lengths and quality of alignments to the two genomes by BLAST are 
presented in Additional file 6: Table S4. 








scaffolds for gene and repeat elements and by annotat- 
ing known genes, to validate high-throughput annota- 
tion, and using this, we designed and carried out a 
student development program (see Genome Annotation 
and Education in Additional file 1). 

Comparative analyses of the A. vittata scaffolds against 
the chicken (Gallus gallus) [5] and zebra finch (Taeniopygia 
guttata) [6] genomes using local BLAST [4] resulted in 


93.4 Mbp of total length of alignments to the chicken ge- 
nome with 82.7% identity on average (average bit score 
577.3), and 41.7 Mbp of total length of alignments to the 
zebra finch genome with 84.5% identity on average (average 
bit score 431.1). 

The top BLAST alignments were sorted by the average 
of their locations, and their frequencies were calculated 
in 1 Mbp bins and plotted along all of the chromosomes 
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Figure 2 Proportion of sequences with some similarity across the two avian genomes (G. gallus and T. guttata). A. vittata scaffolds are 
classified into five categories (A) unmapped - those that were not found any similar sequence, (B) chicken only — those that shared similarity only 
with a fragment of G. gallus genome; (C) finch only — those that shared similarity only with a T. guttata genome; (D) mismatched — those scaffolds 
that shared similarity with sequences of G. gallus and T. guttata genomes but mapped to different chromosomes in the two species; (E) matched 
— those that mapped to the same chromosome in reference genomes of the two avian species. Proportions are represented as totals (pie chart), 
absolute numbers (top) and proportions per chromosome (bottom). The associated data are provided in Additional file 9: Table S5. 
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for both G. gallus and T. guttata genomes using Circos 
[7] (Figure 1). The chicken genome coverage was higher 
(109 scaffolds per Mbp in chicken on average vs. 72 in 
zebra finch), but the chicken genome also had more 
locations with higher genome coverage. As high as 57% 
of the scaffolds could be partially aligned to one or both 
of the genomes: 21.7% aligned only to G. gallus, and 
10.6% aligned exclusively to T. guttata, while 25% 
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aligned to both genomes (Figure 2). These data are pre- 
sented and summarized for chicken in Additional file 6: 
Table S4.A, for zebra finch in Additional file 7: Table S4. 
B, and the complete information in Additional file 8: 
Table S4.C. 

Although a large proportion of scaffolds shared some 
similarity with the two avian genomes, there was also 
discordance as only 12.6% of the scaffolds (2.8% of the 
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Figure 3 Synteny of alignment of the A. vittata scaffolds to two avian reference genomes (G. gallus and T. guttata). The connecting 
lines show the proportion of scaffolds that mapped to 7. guttata chromosomes on the left side to G. gallus chromosomes on the right side. 

The chromosomes are shown in order from top to bottom and designated in the same color for the both species. For simplicity, different colors 
are used only for the three largest chromosomes. Chromosome 1 in G. gallus corresponds to chromosomes 1, 1A and 1B in 7. guttata shown 
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total number of scaffolds) aligned to the same chromo- 
some in both species (Figure 2, top and Additional file 9: 
Table S5), and the proportion of discordance varied 
across chromosomes, with the lowest value on chromo- 
some 11 (Figure 2, bottom and Additional file 9: 
Table S5). While this lack of synteny could point to ex- 
tensive rearrangements during the evolutionary history, 
the proportions of scaffolds discordantly aligned between 
chromosomes seemed to be distributed similarly relative 
to chromosome lengths, indicating a significant random 
component (Figure 3). To test this, we selected the 200 
longest scaffolds and independently queried 500 bp ends 
to the chicken genome. Of these, only 10 scaffolds (5%) 
showed discordance by aligning to the opposite ends to 
two or more different chicken chromosomes (see Com- 
parative Analysis in Additional file 1). 

In summary, these data represent the first assembly of 
a genome sequence for a parrot endemic to the United 
States, and also the first genome of a species from the di- 
verse and ecologically important genus, Amazona, native 
to South America and the Caribbean. The assembled se- 
quence provides a starting point towards completing and 
annotating a draft genome sequence. The data available 
at this coverage will be helpful in designing the future se- 
quencing efforts, and can also be used for annotation and 
comparative genomic studies across the growing amount 
of avian genome data [5,6,8], which is essential given 
the growing rate of extinction among avian species 
worldwide. 


Availability of supporting data 

The raw reads are available at the ENA (accession 
#PRJEB225). Scaffolds and the assembly parameters 
have been submitted to the GenBank (accession 
#PRJNA171587), and all data, including FASTA files of 
contigs, scaffolds, corresponding assembly parameters, 
and annotation data are available in GigaDB [9]. The 
links to all the supplementary tables and databases are 
listed in (Additional files 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, and 16) and can also be accessed at http:// 
genomes.uprm.edu/gigascience/Supplementary Tables/. 


Additional files 





Additional file 1: Supplementary materials. 


Additional file 2: Table $1. Quality and volume of four DNA samples 
extracted from whole blood of two Amazona vittata parrots selected for 
the genome sequencing. 


Additional file 3: Table $2. Results of the genome sequencing 
(Illumina HiSeg, Axeq Technologies). Pa9a_1 and Pa9a_2 represent the 
opposite ends of the 300 bp short reads, and the Pa9a-MP_1 and 
PaQa-MP_2 are the 2,500 bp mate pairs (MP). All sequences were 101 
bp long. 
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Additional file 4: Table $3. Results of the genome assembly by 


SOAPdenovo [8]. 


Additional file 5: Supplementary figures. Figure $1. Venn diagram of 
the overlap between the number of A. vittata sca 
transcripts from GenBank that were mapped to them by BLAST. Figure $2. 
A single example of chimera detected on scaffold-74754 after visual 


inspection of reads mapped to 100 largest scaffolds. Figure $3. 


Percentage of scaffolds containing fragments with > 95% similarity to 
GenBank sequences. Figure $4. Comparison between categories of 

A. guttata scaffolds (described earlier in Figure 2): 
medians, Q1, Q3 and the extreme values. The mea 


Table 3. A. Distribution of scaffold lengths; B. Distribution of de 
genes mapped per kbp of scaffold length. C. Differences in the 
distribution of proportion of the length of t 


distribution of proportion of the length of t 
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Additional file 6: Table S4A. Summary of the alignment of A 
sequences to the G. gallus genome sequence containing only 
alignment for each scaffold, its chromosomal position and qua 
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Additional file 8: Table S4C. The database of the alignment 


information of A. vittata sequences to G. gallus and T. guttata genome 


sequence by BLAST. 


Additional file 9: Table $5. Proportions of sequences with some 
similarity that mapped to chromosomes of two reference avian genomes 


(G. gallus and T. guttata). 


Additional file 10: Table S6A. The summary of the database of 
GenBank sequences with more than 95% similarity with the parrot 


scaffolds. 


Additional file 11: Table S6B. The database of GenBank sequences 
with more than 95% similarity with the parrot scaffolds found by BLAST. 
S7A. A map of G. gallus transcripts from NCBI Entrez Gene database that 


mapped to one of the A. guttata scaffolds. 


Additional file 12: Table S7A. A map of G. gallus transcripts from NCBI 


Entrez Gene. 


Additional file 13: Table S7B. The database of alignments between of 
G. gallus transcripts from NCBI Entrez Gene database and A. guttata 


scaffolds by BLAST. 


Additional file 14: Table S8. Distribution of different cases of repetitive 


elements among different classes of A. guttata scaffolds. 


ity scores. 
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Additional file 15: Table S9. Bioinformatics tools and outputs for 
scaffold and gene annotation. 


Additional file 16: Table $10. An example of annotation output 
produced by a student in the Genome annotation class using A. vittata 
genome. 
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